Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use np.asarray to convert a 1-D array to datetime type in array_to_datetime #2481

Merged
merged 8 commits into from
Apr 6, 2023

Conversation

seisman
Copy link
Member

@seisman seisman commented Apr 5, 2023

Description of proposed changes

Address #2480.

The initial purpose of this PR is to fix the failing doctests as mentioned in #2480, but in comment #2481 (comment) I decided to refactor the array_to_datetime function.


This PR fixes the following failing test:

______________ [doctest] pygmt.clib.conversion.array_to_datetime _______________
276 
277     Examples
278     --------
279     >>> import datetime
280     >>> # numpy.datetime64 array
281     >>> x = np.array(
282     ...     ["2010-06-01", "2011-06-01T12", "2012-01-01T12:34:56"],
283     ...     dtype="datetime64",
284     ... )
285     >>> array_to_datetime(x)
Differences (unified diff with -expected +actual):
    @@ -1,3 +1,3 @@
     DatetimeIndex(['2010-06-01 00:00:00', '2011-06-01 12:00:00',
                    '2012-01-01 12:34:56'],
    -              dtype='datetime64[ns]', freq=None)
    +              dtype='datetime64[s]', freq=None)

It's caused by incompatible changes as documented in
https://pandas.pydata.org/docs/whatsnew/v2.0.0.html#construction-with-datetime64-or-timedelta64-dtype-with-unsupported-resolution.

Fixes #

Reminders

  • Run make format and make check to make sure the code follows the style guide.
  • Add tests for new features or tests that would have caught the bug that you're fixing.
  • Add new public functions/methods/classes to doc/api/index.rst.
  • Write detailed docstrings for all functions/methods.
  • If wrapping a new module, open a 'Wrap new GMT module' issue and submit reasonably-sized PRs.
  • If adding new functionality, add an example to docstrings or tutorials.
  • Use underscores (not hyphens) in names of Python files and directories.

Slash Commands

You can write slash commands (/command) in the first line of a comment to perform
specific operations. Supported slash commands are:

  • /format: automatically format and lint the code
  • /test-gmt-dev: run full tests on the latest GMT development version

@seisman seisman marked this pull request as draft April 5, 2023 02:42
@seisman
Copy link
Member Author

seisman commented Apr 5, 2023

A new failing test after the above fix:

______________ [doctest] pygmt.clib.conversion.array_to_datetime _______________
303     >>> x = [
304     ...     "2018",
305     ...     "2018-02",
306     ...     "2018-03-01",
307     ...     "2018-04-01T01:02:03",
308     ...     "5/1/2018",
309     ...     "Jun 05, 2018",
310     ...     "2018/07/02",
311     ... ]
312     >>> array_to_datetime(x)
UNEXPECTED EXCEPTION: ValueError('unconverted data remains when parsing with format "%Y": "-02", at position 1. You might want to try:\n    - passing `format` if your strings have a consistent format;\n    - passing `format=\'ISO8601\'` if your strings are all ISO8601 but not necessarily in exactly the same format;\n    - passing `format=\'mixed\'`, and the format will be inferred for each element individually. You might want to use `dayfirst` alongside this.')
Traceback (most recent call last):
  File "/usr/share/miniconda3/envs/pygmt/lib/python3.8/doctest.py", line 1336, in __run
    exec(compile(example.source, filename, "single",
  File "<doctest pygmt.clib.conversion.array_to_datetime[8]>", line 1, in <module>
  File "/home/runner/work/pygmt/pygmt/pygmt/clib/conversion.py", line 329, in array_to_datetime
    return pd.to_datetime(array)
  File "/usr/share/miniconda3/envs/pygmt/lib/python3.8/site-packages/pandas/core/tools/datetimes.py", line 1082, in to_datetime
    result = convert_listlike(argc, format)
  File "/usr/share/miniconda3/envs/pygmt/lib/python3.8/site-packages/pandas/core/tools/datetimes.py", line 453, in _convert_listlike_datetimes
    return _array_strptime_with_fallback(arg, name, utc, format, exact, errors)
  File "/usr/share/miniconda3/envs/pygmt/lib/python3.8/site-packages/pandas/core/tools/datetimes.py", line 484, in _array_strptime_with_fallback
    result, timezones = array_strptime(arg, fmt, exact=exact, errors=errors, utc=utc)
  File "pandas/_libs/tslibs/strptime.pyx", line 530, in pandas._libs.tslibs.strptime.array_strptime
  File "pandas/_libs/tslibs/strptime.pyx", line 355, in pandas._libs.tslibs.strptime.array_strptime
ValueError: unconverted data remains when parsing with format "%Y": "-02", at position 1. You might want to try:
    - passing `format` if your strings have a consistent format;
    - passing `format='ISO8601'` if your strings are all ISO8601 but not necessarily in exactly the same format;
    - passing `format='mixed'`, and the format will be inferred for each element individually. You might want to use `dayfirst` alongside this.

It's caused by the incompatible changes in https://pandas.pydata.org/docs/whatsnew/v2.0.0.html#datetimes-are-now-parsed-with-a-consistent-format.

In the past, to_datetime() guessed the format for each element independently. This was appropriate for some cases where elements had mixed date formats - however, it would regularly cause problems when users expected a consistent format but the function would switch formats between elements. As of version 2.0.0, parsing will use a consistent format, determined by the first non-NA value (unless the user specifies a format, in which case that is used).

In our case, we're passing datetimes in different formats, but pandas tries to parse all datetimes using a consistent format %Y, which is guessed based on the first datetime (i.e., 2018).

@seisman
Copy link
Member Author

seisman commented Apr 5, 2023

A new failing test after the above fix:

______________ [doctest] pygmt.clib.conversion.array_to_datetime _______________
303     >>> x = [
304     ...     "2018",
305     ...     "2018-02",
306     ...     "2018-03-01",
307     ...     "2018-04-01T01:02:03",
308     ...     "5/1/2018",
309     ...     "Jun 05, 2018",
310     ...     "2018/07/02",
311     ... ]
312     >>> array_to_datetime(x)
UNEXPECTED EXCEPTION: ValueError('unconverted data remains when parsing with format "%Y": "-02", at position 1. You might want to try:\n    - passing `format` if your strings have a consistent format;\n    - passing `format=\'ISO8601\'` if your strings are all ISO8601 but not necessarily in exactly the same format;\n    - passing `format=\'mixed\'`, and the format will be inferred for each element individually. You might want to use `dayfirst` alongside this.')
Traceback (most recent call last):
  File "/usr/share/miniconda3/envs/pygmt/lib/python3.8/doctest.py", line 1336, in __run
    exec(compile(example.source, filename, "single",
  File "<doctest pygmt.clib.conversion.array_to_datetime[8]>", line 1, in <module>
  File "/home/runner/work/pygmt/pygmt/pygmt/clib/conversion.py", line 329, in array_to_datetime
    return pd.to_datetime(array)
  File "/usr/share/miniconda3/envs/pygmt/lib/python3.8/site-packages/pandas/core/tools/datetimes.py", line 1082, in to_datetime
    result = convert_listlike(argc, format)
  File "/usr/share/miniconda3/envs/pygmt/lib/python3.8/site-packages/pandas/core/tools/datetimes.py", line 453, in _convert_listlike_datetimes
    return _array_strptime_with_fallback(arg, name, utc, format, exact, errors)
  File "/usr/share/miniconda3/envs/pygmt/lib/python3.8/site-packages/pandas/core/tools/datetimes.py", line 484, in _array_strptime_with_fallback
    result, timezones = array_strptime(arg, fmt, exact=exact, errors=errors, utc=utc)
  File "pandas/_libs/tslibs/strptime.pyx", line 530, in pandas._libs.tslibs.strptime.array_strptime
  File "pandas/_libs/tslibs/strptime.pyx", line 355, in pandas._libs.tslibs.strptime.array_strptime
ValueError: unconverted data remains when parsing with format "%Y": "-02", at position 1. You might want to try:
    - passing `format` if your strings have a consistent format;
    - passing `format='ISO8601'` if your strings are all ISO8601 but not necessarily in exactly the same format;
    - passing `format='mixed'`, and the format will be inferred for each element individually. You might want to use `dayfirst` alongside this.

It's caused by the incompatible changes in https://pandas.pydata.org/docs/whatsnew/v2.0.0.html#datetimes-are-now-parsed-with-a-consistent-format.

In the past, to_datetime() guessed the format for each element independently. This was appropriate for some cases where elements had mixed date formats - however, it would regularly cause problems when users expected a consistent format but the function would switch formats between elements. As of version 2.0.0, parsing will use a consistent format, determined by the first non-NA value (unless the user specifies a format, in which case that is used).

In our case, we're passing datetimes in different formats, but pandas tries to parse all datetimes using a consistent format %Y, which is guessed based on the first datetime (i.e., 2018).

Here are some possible options:

1. Always passing format="mixed" to the pd.to_datetime call

return pd.to_datetime(array)

so that the format will be inferred for each element individually. This is the default behavior in pandas < 2.0.0. However, format="mixed" is not available for pandas < 2.0.0, so we have to pin pandas to >=2.0.0 or check the pandas version.
2. Just remove the failing test, which also means we no longer support a list of datetimes with mixed formats.
3. Let array_to_datetime support **kwargs so that we can pass any keyword arguments to pd.to_dataframe. But the array_to_datetime function is only internally used by PyGMT, so users who pass a list of datetimes with mixed formats will still see the same error.

I'm inclined to option 2. What do you think @GenericMappingTools/pygmt-maintainers?

@seisman
Copy link
Member Author

seisman commented Apr 5, 2023

The array_to_datetime function was added in PR #464 to support various datetime types and is used only once in the put_vector function:

pygmt/pygmt/clib/session.py

Lines 797 to 805 in aaa0ca0

gmt_type = self._check_dtype_and_dim(vector, ndim=1)
if gmt_type in (self["GMT_TEXT"], self["GMT_DATETIME"]):
vector_pointer = (ctp.c_char_p * len(vector))()
if gmt_type == self["GMT_DATETIME"]:
vector_pointer[:] = np.char.encode(
np.datetime_as_string(array_to_datetime(vector))
)
else:
vector_pointer[:] = np.char.encode(vector)

Please note that before calling array_to_datetime(vector), we also call _check_dtype_and_dim(vector) (at line 797).

In the _check_dtype_and_dim function, we use np.asarray(array, dtype=np.datetime64) to check if a list/array can be recognized as np.datetime64 (see below).

pygmt/pygmt/clib/session.py

Lines 744 to 752 in aaa0ca0

# Check that the array has a valid/known data type
if array.dtype.type not in DTYPES:
try:
# Try to convert any unknown numpy data types to np.datetime64
array = np.asarray(array, dtype=np.datetime64)
except ValueError as e:
raise GMTInvalidInput(
f"Unsupported numpy data type '{array.dtype.type}'."
) from e

Thus, if a list/vector can't be processed by np.asarray(array, dtype=np.datetime64), it will not be recognized as a valid datetime list/vector and won't be processed by array_to_datearray.

In short, I think we should use np.asarray(array, dtype=np.datetime64) instead of pd.to_datetime() to do the conversion.

Comment on lines -308 to -310
... "5/1/2018",
... "Jun 05, 2018",
... "2018/07/02",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These three types of datetime inputs are no longer supported. Actually, they're never supported because they won't be recognized as valid datetime strings in the _check_dtype_and_dim function.

@seisman seisman force-pushed the pandas/2.0.0/datetime64 branch from 97abc5f to 2ac723e Compare April 5, 2023 13:28
@seisman
Copy link
Member Author

seisman commented Apr 5, 2023

In short, I think we should use np.asarray(array, dtype=np.datetime64) instead of pd.to_datetime() to do the conversion.

I've made the change in commit 608e29b and also use the same function in _check_dtype_and_dim (commit 2ac723e). Now the tests pass and the datetime chart tutorial (https://pygmt-dev--2481.org.readthedocs.build/en/2481/tutorials/advanced/date_time_charts.html) looks good.

@seisman seisman added the maintenance Boring but important stuff for the core devs label Apr 5, 2023
@seisman seisman added this to the 0.10.0 milestone Apr 5, 2023
@seisman seisman added the needs review This PR has higher priority and needs review. label Apr 5, 2023
@seisman seisman changed the title Update the array_to_datetime doctest for pandas 2.0.0 Use np.asarray to convert a 1-D array to datetime type in array_to_datetime Apr 5, 2023
@seisman seisman changed the title Use np.asarray to convert a 1-D array to datetime type in array_to_datetime Use np.asarray to convert a 1-D array to datetime type in array_to_datetime Apr 5, 2023
@seisman seisman marked this pull request as ready for review April 5, 2023 14:06
@seisman seisman requested a review from a team April 6, 2023 00:35
pygmt/clib/conversion.py Outdated Show resolved Hide resolved
"""
return pd.to_datetime(array)
return np.asarray(array, dtype=np.datetime64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like there was a reason we used pd.to_datetime instead of np.asarray in #464, maybe to support funny dates like 'Jun 05, 2018', but since the tests pass and you mentioned in https://github.com/GenericMappingTools/pygmt/pull/2481/files#r1158514955 that it's not actually supported by _check_dtype_and_dim, then it should be fine 🤞

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like there was a reason we used pd.to_datetime instead of np.asarray in #464, maybe to support funny dates like 'Jun 05, 2018',

In the Plotting datetime charts tutorial, we probably need to add a paragraph or a subsection to explain that, for non-ISO datetimes like 'Jun 05, 2018', people should use pd.to_datetime to process it first before passing it to PyGMT.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under https://www.pygmt.org/v0.9.0/tutorials/advanced/date_time_charts.html#generating-an-automatic-region, pd.to_datetime is used to convert dates like "20220712" into a recognizable format. Maybe add a few sentences somewhere there?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seisman
Copy link
Member Author

seisman commented Apr 6, 2023

For mysterious reasons, suddenly mamba now installs xarray v0.19.0 which was released more than 1 year ago, and doesn't work with pandas 2.0.0.

@seisman
Copy link
Member Author

seisman commented Apr 6, 2023

For mysterious reasons, suddenly mamba now installs xarray v0.19.0 which was released more than 1 year ago, and doesn't work with pandas 2.0.0.

Related to pydata/xarray#7716.

@weiji14
Copy link
Member

weiji14 commented Apr 6, 2023

For mysterious reasons, suddenly mamba now installs xarray v0.19.0 which was released more than 1 year ago, and doesn't work with pandas 2.0.0.

Related to pydata/xarray#7716.

Yeah, looks like they did a repodata patch at conda-forge/conda-forge-repodata-patches-feedstock#426 for xarray versions 2022.10.0 to 2023.2.0 at (i.e. the CalVer versions), but they missed the older SemVer versions (e.g. 0.19.0) so we'll need to wait for conda-forge/conda-forge-repodata-patches-feedstock#429 🤷

@seisman seisman mentioned this pull request Apr 6, 2023
7 tasks
@seisman seisman added final review call This PR requires final review and approval from a second reviewer and removed needs review This PR has higher priority and needs review. labels Apr 6, 2023
examples/tutorials/advanced/date_time_charts.py Outdated Show resolved Hide resolved
pygmt/clib/conversion.py Outdated Show resolved Hide resolved
seisman and others added 2 commits April 6, 2023 16:35
@seisman seisman merged commit ab3d689 into main Apr 6, 2023
@seisman seisman deleted the pandas/2.0.0/datetime64 branch April 6, 2023 10:57
@seisman seisman removed the final review call This PR requires final review and approval from a second reviewer label Apr 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Boring but important stuff for the core devs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants